This tutorial will guide you through the implementation of a recurrent neural network to analyze movie reviews on IMDB and decide if they are positive or negative reviews.
The IMDB dataset consists of 25,000 reviews, each with a binary label (1 = positive, 0 = negative). Here is an example review:
“Okay, sorry, but I loved this movie. I just love the whole 80’s genre of these kind of movies, because you don’t see many like this...” -~CupidGrl~
The dataset contains a large vocabulary of words, and reviews have variable length ranging from tens to hundreds of words. We reduce the complexity of the dataset with several steps:
vocab_size = 20000 words by replacing the less frequent words with a Out-Of-Vocab (OOV) character. max_len = 128 words. We have already done this preprocessing and saved the data in a pickle file: imdb_data.pkl.
The needed file can be downloaded from https://s3-us-west-1.amazonaws.com/nervana-course/imdb_data.pkl and placed in the data directory.
In [ ]:
import pickle as pkl
data = pkl.load(open('data/imdb_data.pkl', 'r'))
The data dictionary contains four numpy arrays for the data:
data['X_train'] is an array with shape (20009, 128) for 20009 example reviews, each with up to 128 words.data['Y_train'] is an array with shape (20009, 1) with a target label (positive=1, negative=0) for each review.data['X_valid'] is an array with shape (4991, 128) for the 4991 examples in the test set.data['Y_valid'] is an array with shape (4991, 1) for the 4991 examples in the test set.
In [ ]:
print data['X_train'].shape
In [ ]:
from neon.backends import gen_backend
be = gen_backend(backend='gpu', batch_size=128)
To train the model, we use neon's ArrayIterator object which will iterate over these numpy arrays, returning a minibatch of data with each call to pass to the model.
In [ ]:
from neon.data import ArrayIterator
import numpy as np
data['Y_train'] = np.array(data['Y_train'], dtype=np.int32)
data['Y_valid'] = np.array(data['Y_valid'], dtype=np.int32)
train_set = ArrayIterator(data['X_train'], data['Y_train'], nclass=2)
valid_set = ArrayIterator(data['X_valid'], data['Y_valid'], nclass=2)
In [ ]:
from neon.initializers import Uniform, GlorotUniform
init_glorot = GlorotUniform()
init_uniform = Uniform(-0.1/128, 0.1/128)
The network consists of sequential list of the following layers:
LookupTable is a word embedding that maps from a sparse one-hot representation to dense word vectors. The embedding is learned from the data.LSTM is a recurrent layer with “long short-term memory” units. LSTM networks are good at learning temporal dependencies during training, and often perform better than standard RNN layers.RecurrentSum is a recurrent output layer that collapses over the time dimension of the LSTM by summing outputs from individual steps.Dropout performs regularization by silencing a random subset of the units during training.Affine is a fully connected layer for the binary classification of the outputs.
In [ ]:
from neon.layers import LSTM, Affine, Dropout, LookupTable, RecurrentSum
from neon.transforms import Logistic, Tanh, Softmax
from neon.models import Model
layers = [
LookupTable(vocab_size=20000, embedding_dim=128, init=init_uniform),
LSTM(output_size=128, init=init_glorot, activation=Tanh(),
gate_activation=Logistic(), reset_cells=True),
RecurrentSum(),
Dropout(keep=0.5),
Affine(nout=2, init=init_glorot, bias=init_glorot, activation=Softmax())
]
# create model object
model = Model(layers=layers)
In [ ]:
from neon.optimizers import Adagrad
from neon.transforms import CrossEntropyMulti
from neon.layers import GeneralizedCost
cost = GeneralizedCost(costfunc=CrossEntropyMulti(usebits=True))
optimizer = Adagrad(learning_rate=0.01)
Callbacks allow the model to report its progress during the course of training. Here we tell neon to save the model every epoch .
In [ ]:
from neon.callbacks import Callbacks
model_file = 'imdb_lstm.pkl'
callbacks = Callbacks(model, eval_set=valid_set, serialize=1, save_path=model_file)
In [ ]:
model.fit(train_set, optimizer=optimizer, num_epochs=2,
cost=cost, callbacks=callbacks)
In [ ]:
from neon.transforms import Accuracy
print "Test Accuracy - {}".format(100 * model.eval(valid_set, metric=Accuracy()))
print "Train Accuracy - {}".format(100 * model.eval(train_set, metric=Accuracy()))